Method and apparatus for determing musical notes from sounds

ABSTRACT

This method and apparatus extract symbolic high-level musical structure resembling that of a music score. Humming or the like is converted with this invention into a sequence of notes that represent the melody that the user (usually human, but potentially animal) is trying to express. These retrieved notes each contain information such as a pitch, the start time and duration and the series contains the relative order of each note. A possible application of the invention is a music retrieval system whereby humming forms the query to some search engine.

FIELD OF THE INVENTION

The present invention relates to determining musical notes from sounds,such as humming or singing. In particular it relates to converting suchsounds into notes and recognising them for the purpose of musicretrieval. It also relates to the component means and processes.

BACKGROUND ART

Multimedia content is an increasingly popular resource, supported by asurging market for personal digital music devices, an increase ofbandwidth to the home and the emergence of 3G wireless devices. There isan increasing need for an effective searching mechanism for multimediacontent. Though many systems exist for content-based retrieval ofimages, few mechanisms are available to retrieve the audio portion ofmultimedia content. One possibility for such mechanisms is retrieval byhumming, whereby a user searches by humming melodies of a desiredmusical piece into a system. This incorporates a melody transcriptiontechnique.

FIG. 1 shows a flowchart for a known system of humming recognition. Themelody transcription technique consists of a silence discriminator 101,pitch detector 102 and note extractor 103. It is assumed that each notewill be separated by a reasonable amount of silence. This reduces theproblem of segmentation to a silence detection problem.

In U.S. Pat. No. 6,188,010 a FFT (Fast Fourier Transform) algorithm isused to analyse sound by obtaining the frequency spectrum informationfrom waveform data. The frequency of the voice is obtained and finally amusic note that has the nearest pitch is selected.

In U.S. Pat. No. 5,874,686 an autocorrelation-based method is used todetect the pitch of each note. In order to improve the performance androbustness of the pitch-tracking algorithm, a cubic-spline wavelettransform or other suitable wavelet transform is used.

In U.S. Pat. No. 6,121,530 the onset time of the voiced sound is dividedoff as an onset time of each note, a time difference with an onset timeof the next note is determined as the span of the note and the maximumvalue among the fundamental frequencies of each note contained duringits span is defined as the highest pitch values.

Automatic melody transcription is the extraction of an acceptablemusical description from humming. Typical humming signal consists of asequence of audible waveforms interspersed with silence. However, thereis difficulty in defining the boundary of each note in an acoustic waveand there is also considerable controversy over exactly what pitch is.Sound recognition involves using approximations. Where boundariesbetween notes are clear and pitch is constant, the prior art can producereasonable results. However, that is not necessarily so where eachaudible waveform may contain several notes and pitch is not necessarilymaintained, as happens with real people humming. A hummer's inability tomaintain a pitch often results in pitch changes within a single note,which may be subsequently misinterpreted as a note change. On the otherhand, if a hummer does not pause adequately when humming a string of thesame notes, the transcription system might interpret it as one note. Thetask becomes increasingly difficult in the presence of expressivevariations and the physical limitation of the human vocal system.

OBJECT AND SUMMARY OF THE INVENTION

It is therefore an aim of the present invention to provide an improvedsystem for recognising hummed tunes or the like and to provide componentprocesses and apparatus that can be used in such a venture.

According to a first aspect of the invention, there is provided a methodfor use in transcribing a musical sound signal to musical notes,comprising the steps of:

-   -   producing note markers, indicative of the beginnings and ends of        notes in said sound signal; and    -   detecting the pitch values of notes marked by said note markers.

Preferably this method further comprises detecting portions of saidsound signal that can be deemed to be silences.

This method may also further comprise the step of extracting notes fromsaid pitch values to create note descriptors.

According to a second aspect of the invention, there is provided amethod for detecting portions of a musical sound signal that can bedeemed to be silences, comprising the steps of:

-   -   dividing said sound signal into at least one group of blocks;    -   deriving short-time energy values of said blocks in a group;    -   deriving a threshold value based on said short-time energy        values; and    -   using said threshold value to classify blocks of said group as        silent or otherwise.

According to a third aspect of the invention, there is provided a methodof producing note markers, indicative of the beginnings and endings ofnotes in a musical sound signal, comprising the steps of:

-   -   extracting an envelope of said sound signal;    -   differentiating said envelope to compute a gradient function;        and    -   extracting note markers from said gradient function, indicative        of the beginnings and ends of notes in said sound signal.

The process of envelope extraction may comprise the steps of:

-   -   performing full-wave rectification on said sound signal; and    -   low-pass filtering the output of the full-wave rectification.

The process of differentiation may comprise the steps of:

-   -   determining the gradient of said envelope; and    -   low-pass filtering said gradient.

The process of note markers extraction may comprises the steps of:

-   -   removing small gradients from said gradient function;    -   extracting turning points of the attack and decay of remaining        gradients;    -   removing unwanted attacks and decays; and    -   registering remaining attacks and decays as said note markers.

According to a fourth aspect of the invention, there is provided amethod for detecting the pitch values of notes in a musical soundsignal, comprising the steps of:

-   -   isolating notes in the sound signal;    -   dividing said notes into one or more groups of blocks;    -   deriving pitch values of said blocks; and    -   deriving the pitch values of said notes by means of clustering        on said pitch values of said blocks.

This process of isolating notes may use note markers to do so.

One or more of the above aspects may be combined.

According to a fifth aspect of the invention, there is provided a methodof identifying pieces of music, comprising the steps of:

-   -   receiving a musical sound signal imitative of a piece of music;    -   transcribing said musical sound signal to a series of musical        notes and timings using the method of the first aspect above;    -   comparing said series of musical notes and timings with series        of notes and timings of pieces of music in a database; and    -   identifying the piece of music deemed most similar by this        comparison.

Following this, the identified piece of music may then be retrieved.

The invention is not limited to human use. It may be useful inconducting experiments with animals. Moreover, it is not limited tohumming, but could be used with whistling, singing or other noiseproduction.

The invention also provides apparatus operable according to the abovemethods and apparatus corresponding to the above methods.

This method and apparatus extract symbolic high-level musical structureresembling that of a music score. Humming or the like is converted withthis invention into a sequence of notes that represent the melody thatthe user (usually human, but potentially animal) is trying to express.These retrieved notes each contain information such as a pitch, thestart time and duration and the series contains the relative order ofeach note. A possible application of the invention is a music retrievalsystem whereby humming forms the query to some search engine. Musicretrieval via query-by-humming can be applied to different applicationssuch as PC, cellular phone, portable jukebox, music kiosk and carjukebox.

BRIEF DESCRIPTION OF DRAWING

The present invention is now further described by way of non-limitativeexample, with reference to the accompanying drawings, in which:—

FIG. 1 is a flowchart of a prior art melody transcription technique;

FIG. 2 is a schematic block diagram of an embodiment of the presentinvention;

FIG. 3 is a flowchart of a melody transcription technique used in theembodiment of FIG. 2;

FIG. 4 is a flowchart of operation of a silence discriminator used inthe embodiment of FIG. 2;

FIG. 5A is a flowchart of gradient-based segmentation used in theembodiment of FIG. 2;

FIG. 5B is an illustration of a typical humming waveform;

FIG. 5C is an illustration of the output of the envelope detector, withthe waveform of FIG. 5B as the input;

FIG. 5D is an illustration of the output of the differentiator, with thewaveform of FIG. 5C as the input;

FIG. 5E is an illustration of the note markers produced by the notemarkers extractor, with the waveform of FIG. 5D as the input.

FIG. 6 is a flowchart of operation of an envelope detector used in theembodiment of FIG. 2;

FIG. 7 is a flowchart of operation of a differentiator used in theembodiment of FIG. 2;

FIG. 8 is a schematic illustration of the criteria for selection oflegitimate attack and decay;

FIG. 9 is a flowchart of operation of a note marker extractor used inthe embodiment of FIG. 2;

FIG. 10 is a flowchart of a gradient threshold function used in theembodiment of FIG. 2;

FIG. 11 is a flowchart of operation of an edge detector used in theembodiment of FIG. 2;

FIG. 12 is a flowchart of operation of a pitch detector used in theembodiment of FIG. 2; and

FIG. 13 is a flowchart of operation of a dominant pitch detector used inthe embodiment of FIG. 2.

SPECIFIC DESCRIPTION

A robust melody transcription system is proposed to serve as an ensembleof solutions to solve the problem of transcribing humming signal to notedescriptors. A melody technique is used to produce note descriptors.This information is used by a feature extractor to obtain features to beused in a search engine.

FIG. 2 is a schematic block diagram of an embodiment of the presentinvention. A digitised humming input signal S200 from a PC, cell phone,portable jukebox, music kiosk or the like, is input into a melodytranscription device 2. There it is input in parallel into a pitchdetector 202, a silence discriminator 204 and a gradient basedsegmentation unit 206, where it first goes into an envelope detector208. The envelope detector 208 produces an envelope signal S210 from thehumming signal, which is input into a differentiating circuit 212.Another input into this is a silence marker signal S214 from the silencediscriminator 204. The output from the differentiating circuit 212 is agradient function signal S216, which is input into a note markerextractor 218, which also receives the silence marker signal S214 fromthe silence discriminator 204. The note marker extractor 218 outputs anote marker signal S220, which, together with the silence marker signalS214 and humming input signal S200, is input into the pitch detector202. The gradient based segmentation unit 206 is made up of the envelopedetector 208, the differentiating circuit 212 and the note markerextractor 218.

Using the three inputs, the pitch detector 202 produces a pitch valuesignal S222, from which a note extractor circuit 224 produces a notedescriptor signal S226. This then is output from the melodytranscription device 2. In this example, a feature extraction circuit228 produces a feature signal S230, from the note descriptor signalS226. An MPEG-7 descriptor generator 232 uses this to produce a featuredescriptor signal S234, which is fed to a search engine 236. Searchingusing a music database 238 gives a search result S240.

The silence discriminator 204 illustrated in FIG. 2 is employed toisolate the audible portion of the input humming signal S200 from thesilence. The pitch detector 202 is used to compute the pitch of thehumming input S200. The structure of the audible waveform is complex butthe present invention uses detection of an attack and decay pairindicating the existence of a note. Thus the envelope detector 208 isemployed to remove the complex structure of the audible waveform. Thedifferentiator 212 computes the gradient of the envelope S210. Anotherdifficulty is the ambiguous nature of the attack and decay pair thatsymbolises the existence of a note. Unlike musical instruments, peoplecannot transit to the next note with a boundary that is well defined.The problem is compounded by the fact that the volume may change due toexpression or failure by the hummer to maintain the volume. The volumechange might create a false attack and decay within the duration of aparticular note. The note marker extractor 218 is therefore used toremove all the false attacks and decays. The legitimate attack and decaypairs left are used as note makers that mark the start and end of anote. With the knowledge of the location of each note, the pitchdetector 202 computes the pitch of each note. Finally, the noteextractor 224 is employed to map the pitch values and note markers toproduce note descriptors. A note descriptor contains information such aspitch, start time and interval of a particular note.

In this preferred embodiment, the melody transcription system comprisestwo distinct steps: segmentation and pitch detection. The segmentationstep searches the digital signal S200 to find the start and duration ofall notes that the hummer tries to express. The silence discriminator204 isolates the voiced portions. This information has been used in theprior art to segment the digital signal. This is only feasible if ahummer inserts a certain amount of silence between each note. Mostinexperienced hummers have difficulties inserting silence between notes.In this invention, a gradient-based segmentation method is employed tosearch for notes within the voiced portions, thus not relying so much onsilence discrimination.

The humming signal is similar to an amplitude modulated (AM) signalwhere the volume is modulated by the pitch frequency. The pitch signalis not useful in this case, which is removed to extract the envelope.The envelope shows some interesting properties of a typical hummingsignal. The envelope increases sharply from silence to a stable level.The stable level is maintained for a while before it drops back sharplyto silence again. Thus the existence of an attack, followed by a steadylevel and a decay of a note, is evidence of the existence of a note. Thegradient-based segmentation is derived from these unique properties toextract the note markers.

These note markers are used in this invention to enhance the performanceof the pitch detector 202. The approach is to exploit the fact that thepitch within each pair of start and end note makers is supposed to beconstant. The signal of each note is divided into blocks of equallength. The signal in each block is assumed to be stationary and thepitch (frequency) is detected by autocorrelation. In an ideal case,these values are identical. However, the autocorrelation pitch detector202 is sensitive to harmonics that cause errors in the detection ofpitch. Furthermore, hummers frequently fail to maintain the pitch withinthe duration of a particular note. A k-mean clustering algorithm isselected in this invention to find the prominent pitch value.

Music retrieval by humming is perceived as an excellent complement totactile interfaces on handheld devices, such as mobile phones andportable jukeboxes. This invention can also be employed in a ring-toneretrieval system whereby a user can download the desired ring-tone byhumming to a mobile device.

Thus, in this embodiment, a user hums a tune into a microphone attachedto a PC, cell phone, portable jukebox, music kiosk or the like, wherethe input sound is converted into a digital signal and transmitted aspart of a query. The query is sent to a search engine. Melodytranscription and feature extraction modules in the search engineextract relevant features. At the same time, the search engine requestsMPEG-7 compliant music metadata from music metadata servers on its list.The search proceeds to match the music metadata with the featuresextracted from the humming query. The result is sent back to the user,with an indication of the degree of match (in the form of a score) andthe location of the song(s). The user can then activate a link providedby the search engine to download or stream the song from the relevantmusic collection server—possibly for a price. The MPEG-7 descriptorgenerator is optional and depends on the application scenario.

Such a mechanism entails a robust melody transcription subsystem, whichextracts symbolic high-level musical structure resembling that on amusic score. Thus the humming must be converted into a sequence of notesthat represent the melody that the user tries to express. The notescontain information such as the pitch, the start time and the durationof the respective notes. Thus it requires two distinct steps: thesegmentation of the acoustic wave and detection of the pitch of eachsegment.

In the prior art shown in FIG. 1, the melody transcription techniqueconsists of a silence discriminator, pitch detector and note extractor.FIG. 3 is a similar flowchart, showing the components of the presentinvention. Once again, there is a silence discriminator step 301 and apitch detector step 304, which leads to a note extractor step 305.However, in this invention, an additional step is introduced into theconventional technique in the form of an ‘advanced mode’ option step302, following on from silence discriminator step 301. The selection ofthe advanced mode activates the gradient-based segmentation step 303.This step is made up of the processes conducted in the gradient basedsegmentation unit 206 of FIG. 2. Thus the process 303 searches for notemarkers within each voiced waveform. Note markers found are processed inthe pitch detector and note extractor steps, 304 and 305 respectively.

Silence Discriminator

FIG. 4 is a flowchart of the operation of an exemplary silencediscriminator 204 of FIG. 2, the silence discriminator isolating thevoiced portion in the input waveform. The first step is to isolate thevoiced portions from the silence portions of digitised hum waveform. Bypreventing the processing of silence portions, it improves performanceand reduces computation. A data structure is set up using the syntax ofthe C programming language. struct markers{ int start; int interval; };where markers is the struct that marks the start and the interval of thevoiced portion. Thus there is an array of these markers with seg_countmembers.

The necessary parameters are initialised to: seg_count=0, can_start=1and count=0, as shown in 401. The parameter can_start is initialised to‘1’ to signal that a new marker is allowed to be created. This is toprevent creating markers before an interval of voiced portion isregistered. It is followed by process 402 to compute the short-timeenergy function of the digitised hum waveform. The digitised humwaveform is divided into blocks of equal length. The short-time energy,E_(n), for each block is computed as:$E_{n} = {\frac{1}{CAL\_ LENGTH}\quad{\sum\limits_{m}^{CAL\_ LENGTH}\left\lbrack \left( {{x(m)}\quad{w\left( {n - m} \right)}} \right\rbrack \right\rbrack^{2}}}$where x(m) is the discrete time audio signal, w(m) is a rectangle windowfunction and CAL_LENGTH is the length of window and the width of a blockof hum waveform.

In order to be adaptive to different recording environments, thethreshold, thres, is computed as the average of the short-time energyand a count number is set, i=0, as shown in 403. The thres is theaverage short-time energy. This is a reference value used to decidewhether the signal at a particular time is silence or voiced. With thethreshold, the short-time energy of each block is tested as shown in 404and 405. In 404, the current short-time energy value, energy(i), istested to determine whether its level is greater than or equal to 0.9times the threshold and, at the same time, the can_start=1. If thecriteria are met, the process proceeds to block 406, where the start ofthe current block is registered as the start of a voiced portion in 406.The position is calculated as:

-   -   markers[seg_count].start=i*CAL_LENGTH        where i is the index of the current short-time energy.

Furthermore, the can_set is set to ‘−1’ to indicate that the algorithmis expecting a silence portion hence another voiced portion cannot beregistered. If, in step 404, the criteria are not met, the process goesto step 405, where the current short-time energy value, energy(i), istested to determine whether its level is below 0.5*thres and, at thesame time, the can_start=−1. This is taken to mean that the beginning ofa silence portion has been reached and, if these criteria are met, thisis registered as an interval in the voiced portion in step 407. Theposition is calculated as:

-   -   markers[seg_count].interval=i*CAL_LENGTH−markers[seg_count].start.

Following this, the can_start is set to ‘1’ again to flag that theregistration of new marker is allowed and the seg_count is incrementedas shown in 408. The outputs of steps 406 and 408, together with theoutput of step 405 if the criteria are not met, rejoin in step 409,which asks if all blocks have been tested. If the answer is negative, i,the index of the current short-time energy is incremented by 1 in step410 and the process returns to step 404. The processes of steps 404-410are repeated until all the values in the short-time energy function havebeen tested.

Gradient Based Segmentation

The flowchart of exemplary gradient-based segmentation in this inventionis shown in FIG. 5A. The humming signal is similar to an amplitudemodulated (AM) signal where the volume is modulated by the pitchfrequency. The pitch signal is not useful for the segmentationalgorithm. Thus, the pitch frequency is removed to simplify matters. Theenvelope detector step 501 removes the pitch frequency. In this way,only information pertaining to the variation of volume is left. Thedifferentiator step 502 processes this variation to produce a gradientfunction and removes small gradient values in the gradient function.Finally, the note marker extractor step 503 extracts note markers fromthe threshold gradient function. A typical humming signal with threenotes hummed is illustrated in FIG. 5B. The outputs of envelopedetector, differentiator and note markers extractor are illustrated inFIGS. 5C, 5D and 5E respectively.

Envelope Detector

FIG. 6 shows a flowchart for an exemplary envelope detector that isutilised in the gradient-based segmentation as shown in 501. Theenvelope detector consists of two steps: full wave rectification(processes 601 through 605) and a moving average low-pass filter.

The rectifier is simple. In step 601 a count of points in the signal, i,is set to “i=0”. Following step 602 determines if the signal level atthe current signal point is greater than or equal to zero. If it is not,then, in step 603, the envelope level for that point is set to thenegative of the current signal level and i is incremented by 1 in step605. If the current signal point is greater than or equal to zero, then,in step 604, the envelope level for that point is set to the actualsignal level and i is incremented by 1 again in step 605. Step 605 isfollowed by step 606, which determines if “i<LEN”, where LEN is a samplenumber, chosen here to be 200. If it is, then the process reverts tostep 602. If it is not, then the process goes on to the filter.

The low pass filter is implemented by a simple moving average filter toobtain a smooth envelope of the discrete time audio signal. In spite ofits simplicity, the moving average filter is optimal for common taskssuch as reducing random noise while retaining a sharp step response.This property is ideal for this invention, as it is desirable to reducethe random-noise-like roughness while retaining the gradient. As thename implies, the moving average filter operates by averaging a numberof points from the discrete signal to produce each point in the optimalsignal. Thus it can be written as:${y(t)} = {\frac{1}{ENVLEN}\quad{\sum\limits_{j = 0}^{{ENVLEN} - 1}{x\left( {t + j} \right)}}}$where x(t) is the discrete time audio signal with LEN samples, y(t) isthe envelope signal of x(t) and ENVLEN is the number of points in theaverage. The ENVLEN is chosen to be 200 in this exemplary embodiment.

The process 607 initialises the necessary parameters “temp”, “i” and “j”to zero to start the filtering proper. Before proceeding to filtering,the process 608 makes sure that the filter operates within the confineof the discrete time audio signal, by checking that the sum “i+j<LEN”.The processes 609 and 610 compute the summation of all data after thecurrent value. In particular, step 609 provides an updated temporarysummation, with “temp=temp+[i+j]”. The average value of the envelope forall “i” within the sample is computed as shown in 611,“env[i]=temp/ENVLEN”. Step 612 tests whether the process of steps 608 to611 has been repeated for all data in the input buffer and only when ithas does the envelope process end. The “i” and “j” are incremented asshow in 609 and 610 respectively. The “++j” is a pre-increment whichmeans j is incremented between testing the condition. “i++” is apost-increment, which means “i” is incremented after execution of theequation shown in steps 610.

Differentiator

The flowchart of an exemplary differentiator is shown in FIG. 7. Thedifferentiator consists of two steps: gradient computation and movingaverage low-pass filter. The differentiator processes the envelopeproduced by the envelope detector to generate a gradient function. Thealgorithm only computes the gradient values within the voiced portionsmarked by the markers produced by the silence discriminator. Thegradient function essentially describes the changes of the input signal.This can be computed by:$\frac{\partial{y(t)}}{\partial t} \approx \frac{{y\left( {t + {GRADLEN}} \right)} - {y(t)}}{GRADLEN}$where y(t) is the envelope signal and GRADLEN is the deviation of t tothe next point. The GRADLEN is chosen to be 20 in this exemplaryembodiment.

The process is initialised in step 701. The index “j” keeps track of thesegment that is being processed. The index “i” keeps track of the numberof points within one segment is processed. A decision 702 prevents theoverflow of the buffer that contains the envelope. “I+Gradlen” is testedagainst “LEN” to prevent overflow of the buffer as shown in 702. Thegradient is computed by:${Gradient} = \frac{\left\lbrack {{x\left( {i + L} \right)} - {x(i)}} \right\rbrack}{L}$where “L” is the step length, for instance 100. Therefore when there isan overflow, in step 703 the x(I+L) is set to zero. When there is nobuffer overflow, the gradient is computed according to this aboveequation in step 704. The computation in process 703 caters to the casewhen the gradient to be computed is near the end of the buffer. The step705 checks whether all the gradients within the “j” voice segment arecomputed. If it is true, it will proceed to step 706, else to decision702. The step 706 increments the “j” to process the next voiced segment.The “i” is initialised to zero to start from the beginning of thesegment. The decision 707 will check whether all voiced segments havebeen processed. It will proceed to decision 702 if not all voicedsegments are processed.

The process 708 initialises the necessary parameters for the filteringoperation. The filter smoothens the gradient to reduce roughness. Theindex of the buffer is tested as shown in 709 to prevent bufferoverflow. The moving average filter is chosen to smoothen the gradientfunction. The filter is only applied to the voiced portions to reducecomputation. The filter length is defined as FLEN and all data after thecurrent value are summed as shown in 710. The index k is tested if it isgreater than FLEN as shown in 711. The FLEN is chosen to be 200 in thisinvention. When the FLEN is reached, the gradient, grad, is updated asshown in 712. The process is repeated for all points inside the voicedportions, as shown in 713. The processes 709 through 714 are repeateduntil all voiced portions are processed.

Note Makers Extractor

Ideally, there is only a pair of positive and negative gradient peaks tomark the start and end of a note. However, human humming is not idealand the problem is further complicated by the presence of expressionthat causes the amplitude in a particular note to change. Thus the notemarkers extractor has to remove invalid gradient peaks based onpredefined criteria. These criteria are derived from the assumption thateach note must be marked by an attack and followed immediately by adecay. Anything in between is considered a false alarm and has to beremoved. FIG. 8 shows an example that illustrates the idea. FIG. 8illustrates exemplary criteria for selection of legitimate attack anddecay. The criteria of selection of the legitimate attack and decay arebased on the idea that there is only one attack and decay for each note.The 1306 marker is the legitimate attack as it is the first markerdetected. Since decay marker is expected, the 1307 marker is a falseattack. Further down, the 1308 marker is temporary considered a decaymarker. It will be a legitimate decay marker if an attack marker followsit. However, a decay marker 1309 follows it. Thus, marker 1308 isdiscarded and the marker 1309 is temporarily considered a decay marker.The detection of the attack marker 1310 means that marker 1309 can beformally registered as a legitimate decay marker.

The flowchart in FIG. 9 shows an exemplary implementation of theabovementioned technique to remove redundant markers. The note markersextractor removes redundant ON/OFF markers and registers a set oflegitimate note markers. A gradient thrashold module 1001 is firstcalled to remove small gradient values generated by the differentiator212. It produces a train of ON/OFF pulses. An edge detector function iscalled to search for edges from the ON/OFF pulse starting from location0 as shown in 1002. With the location of the nearest marker, thenecessary parameters are initialised as shown in 1003. In the process1003, pos and pg are: Parameter Definition pos Location of thelegitimate attack and decay in the gradient array. pg The gradient valueof the legitimate attack and decay.

The algorithm enters a loop to search and remove all redundant markersas shown in 1004 through 1015. The next edge is detected using the edgedetector starting from the location of the edge found in the last searchas shown in 1004. The test 1005 ensures that the edge detector has foundan edge. The 1007 tests for the case when an attack marker is detectedwhile an attack marker is registered in the previous iteration. In thiscase, the attack marker detected is discarded and the index isincremented to the location of the attack marker as shown in 1011. The1008 tests for the case when a decay marker is detected and an attackmarker is detected in the previous iteration. Thus, the decay markerdetected is registered as a legitimate decay marker as shown in 1012.The 1009 tests for the case when a decay marker is detected but a decaymarker is registered at the previous iteration. Thus, the currentdetected marker replaces the previous one as shown in 1013. Finally, the1010 tests for the case when an attack marker is detected and a decaymarker is detected in the previous iteration. Therefore, the attackmarker is registered as shown in 1014. At a time when the edge detectoris unable to find any edge, there is a final registration of markers forthose still pending, as shown in 1006. Since there are no more edges,the process 1006 breaks out of the loop and continues to the process1016. The seg_count is calculated as the half of the total number ofmarkers registered, as shown in 1016. The processes 1017 and 1018 updatethe markers struct with data from pos.

Gradient Threshold

FIG. 10 shows a flowchart of a simple method to remove the unwantedsmall gradient values. The gradient values are tested as shown in 901.If the absolute value is less than GRADTHRES, it is set to zero as shownin 904. If the value is greater than GRADTHRES and positive, it will beset to a positive number. If the value is greater than GRADTHRES andnegative, it will be set to a negative number. Here +10 and −10 are usedrespectively as an example. This process is shown in 902 through 905. Inthe end, the gradient threshold function will produce positive andnegative pulses such as those shown in 1301 through 1305.

Edge Detector

The On/OFF pulses as shown in FIG. 8 symbolise the location of highgradients. The positive going edges of the pulses as shown by 1301 and1302 are the location where gradient values transit from low to high. Onthe other hand, the negative going edges of the pulse as shown by 1301and 1302 are the locations where the gradient transit from high to low.Thus the negative going edge of the ON pulse is the turning point of theincreasing envelope to a level value. The negative going edge of the ONpulse is detected using the edge detector to obtain the ON markers suchas those shown in 1306 and 1307. Similarly, the positive going edge ofthe OFF pulse is detected using the edge detector to obtain the OFFmarkers such as those shown in 1308 and 1309.

FIG. 11 is a flowchart of an exemplary pulse edge detector. The pulseedge detector detects the next positive or negative edge starting fromthe location specified by start. The process 801 initialises the searchindex, i, to the desired start location. The ps is set to −1 to signalthat no previous transition is detected. A non-zero gradient and ps=−1means that this is the first time an edge is found as tested in 802.Therefore, ps is set to 1 to signal that the first edge is detected asshown in 804. When the gradient value is zero and ps=1, the second edgeis detected as tested in 803. This is a negative going edge for ONpulses and positive edge for OFF pulses. Having detected this edge, thecurrent search index will be return as the edge detected as shown in808. The processes from 802 through 805 will repeat until all data areexhausted. If the all data are exhausted as tested in 806 and no edge isdetected, a −1 will be returned as in 807.

Pitch Detector

The pitch detector 202 detects the pitch of all note registered in themarkers data structure. Every note interval is divided into blocks thatconsist of PLEN samples. The PLEN is chosen to be 100 in this invention.Thus the pitch detection range for an 8 KHz sampled audio signal isbetween 80 to 8 KHz. The signal in each block is assumed to bestationary and the pitch (frequency) is detected by autocorrelation asshown below:${r_{xx}(n)} = {\frac{1}{PLEN}\quad{\sum\limits_{k = 0}^{{PLEN} - n - 1}{{x(k)}\quad{x\left( {k + n} \right)}}}}$where x(k) is the discrete time audio signal.

With this equation, a collection of pitch values that belong to the samenote might be found. In an ideal case, these values are identical.However, the autocorrelation pitch detector is sensitive to harmonicsthat cause errors. Furthermore, the hummer might fail to maintain thepitch within the duration of a particular note.

FIG. 12 shows the flowchart of an exemplary pitch detector. The process1101 computes the square of the input data. The pitch detector is anautocorrelation-based pitch detected with some modification. Theprocesses 1102 through 1114 compute the normalised autocorrelationfunction and find the pitch values of each block in a note.

A data structure is set up as described below using the syntax of the Cprogramming language. struct hum_des{ int pitch; int start; intinterval; };where markers is the struct that marks the start and the interval of thevoiced portion. Thus there is an array of these markers with note_countmembers. The position and interval of a note are registered as:

-   -   hum_des[j].start=marker[j].start    -   hum_des[j].inteval=marker[j].interval        where j is the index and 0≦j<total number of markers.

The pitch values detected may vary due to the failure of a user tomaintain the pitch within a single note. The FindDom function as shownin 1116 finds the dominant pitch value. In this invention, the detectedpitch values are corrected to the nearest MIDI number in 1118. The MIDInumber is computed as:${{{hum\_ des}\lbrack j\rbrack}.{pitch}} = {49 + \frac{{floor}\left\lbrack {12\quad{\log\left( \frac{detected\_ pitch}{440} \right)}} \right\rbrack}{\log\quad 2}}$

The floor(x) function returns a floating-point value representing thelargest integer that is less than or equal to x. The process is repeateduntil all notes in the input data have their pitch detected as shown in1119.

Dominant Pitch Detector

The function of a dominant pitch detector is to collect statistics fromthe collection of pitch values to find the prominent pitch values. Inthis invention, the k-mean clustering method is selected to find theprominent pitch values. The k-mean clustering method does not requireany prior knowledge or assumption about the data except for the numberof clusters required. Determining the number of clustering isproblematic in most applications. In the current invention, theclustering algorithm only needs to cluster the pitch values into twogroups: the prominent cluster and the outlier cluster.

FIG. 13: is a flowchart of an exemplary dominant pitch detector (step1117 of FIG. 12), which uses a k-mean clustering algorithm thatclassifies the pitches into these two groups. The k-mean clustering isan iterative algorithm for clustering data to reveal the underlyingcharacteristic. The number of pitches is tested to check if it isgreater than 3, as shown in decision 1202. The lower and upper 20% ofthe data are discarded to avoid portions of the note that are unstableas shown in 1204. All the pitches will be used for the computation ifthe number of pitches is less than 3. This is attained by setting“lower=0” and “upper” to the number of pitches as shown in 1203. Thecentres of the two clusters are initialised to the maximum and minimumvalues of the data set as show in 1201 through 1210. The index “j” isset to the lower, as shown in 1205. The process 1211 initialises thenecessary parameters and saves the current centres for comparison at alater stage.

The pitch values of the note under test are contained in the arraypitch. The process 1212 compares the absolute distance of the pitchvalue from the two centres. The pitch value is added to the accumulatorscalled, temp1 or temp2 depending on the result of the comparison asshown in 1213 and 1214. This process repeats until all the pitch valuesin the note are tested as shown in 1215. The new centres are computedand the member counts are incremented as shown in 1218 and 1219. Theyare the average of the member pitch values. The processes 1220 and 1221test if the two centres change. If the two centres do not change, theiteration stops immediately. If there are changes in any of the centres,the iteration of the processes from 1211 through 1221 repeat until themaximum number of loops (MAXLOOP) has been reached. The maximum numberof loops is 10 in this exemplary embodiment.

If the numbers of members of the two centres is close, as tested in1223, the average of the two centres is returned as the dominant pitch.If they are not close enough, the centre with the larger number ofmembers is returned as the dominant pitch as shown in 1225 through 1227.In this way, the cluster with the highest number of members isclassified as the prominent cluster while the other cluster isclassified as the outlier cluster. The pitch of the note is set to thecentre of the prominent cluster.

It is in fact possible for the invention to work without the silencediscriminator.

Note extraction is a simple module to gather information from notemarker generator and pitch detector. It then filled a structure thatdescribe the begin time, duration and the pitch value. Featureextraction converts the note descriptors to feature that are used by thesearch engine. The current feature is the melody contour that isspecified in the MPEG-7 standard. The description generation is anoptional module that converts the feature to a format for storage ortransmission.

Effects of Invention

The invention achieves the conversion of human (or animal—e.g. dolphinet al) humming, singing, whistling or other musical noises to musicalnotes. The gradient-based segmentation goes beyond the traditionalsegmentation method that relies on silence. The modifiedautocorrelation-based pitch detector can tolerate a user's failure tomaintain pitch within a single note. This means that the user can humnaturally without consciously trying to pause between notes, which maynot be easy for some users with little musical background.

While exemplary means of achieving the particular component processeshave been illustrated, other means achieving similar ends can readily beincorporated.

1-58. (canceled)
 59. A method for detecting the pitch values of notes ina musical sound signal, comprising the steps of (a) isolating notes inthe sound signal; (b) dividing said notes into one or more groups ofblocks; (c) deriving pitch values of said blocks; and (d) deriving thepitch values of said notes by means of clustering on said pitch valuesof said blocks.
 60. A method according to claim 59, wherein the processof isolating notes uses note markers to do so.
 61. A method according toclaim 59, wherein the blocks in a group are of equal length.
 62. Amethod according to claim 59, wherein each group contains the samenumber of blocks.
 63. A method according to claim 59, wherein theprocess of deriving the pitch values comprises applying k-meanclustering on the block pitch values.
 64. A method according to claim59, further comprising the step (e) of rounding the detected pitchvalues of the notes to the nearest note values.
 65. A method accordingto claim 59, wherein the note isolating step is performed based on adetermination of silences in the musical sound signal.
 66. A methodaccording to claim 59, wherein the note isolating step is performedbased on a determination of note markers in the musical sound signal.67. A method according to claim 63, further comprising the step ofextracting notes from said pitch values to create note descriptors. 68.A method according to claim 59, wherein the musical sound signal isdigitised.
 69. A method according to claim 59, wherein the musical soundsignal is an audio signal of a sound produced by a person.
 70. A methodaccording to claim 69, wherein the sound comprises one or more of thegroup of: humming, singing and whistling at least a portion of a pieceof music.
 71. Apparatus for use in use in detecting the pitch values ofnotes in a musical sound signal, operable according to the method ofclaim
 59. 72. Apparatus for detecting the pitch values of notes in amusical sound signal, comprising: (a) note isolating means for isolatingnotes in the sound signal; (b) pitch value dividing means for dividingsaid notes into one or more groups of blocks, (c) block pitch valuederiving means for deriving pitch values of said blocks; and (d) notepitch value deriving means for deriving the pitch values of said notesby means of clustering on said pitch values of said blocks. 73.Apparatus according to claim 72, wherein said note isolating means usesnote markers to isolate notes.
 74. Apparatus according to claim 72,wherein the blocks in a group are of equal length.
 75. Apparatusaccording to claim 72, wherein each group contains the same number ofblocks.
 76. Apparatus according to claim 72, wherein the note pitchvalue deriving means is operable to apply k-mean clustering on the blockpitch values.
 77. Apparatus according to claim 72, further comprisingrounding means for rounding the detected pitch values of the notes tothe nearest note values.
 78. Apparatus according to claim 72, whereinthe note isolating means operates based on a determination of silencesin the musical sound signal.
 79. Apparatus according to claim 72,wherein the note isolating means operates based on a determination ofnote markers in the musical sound signal.
 80. Apparatus according toclaim 76, further comprising note extracting means for extracting notesfrom said pitch values to create note descriptors.
 81. Apparatusaccording to claim 72, operable to process a digital musical soundsignal.
 82. Apparatus according to claim 72, operable to process amusical sound signal being an audio signal of a sound produced by aperson.
 83. Apparatus according to claim 82, wherein the sound comprisesone or more of the group of humming, singing and whistling at least aportion of a piece of music.
 84. Software which, when loaded, isoperable according to the method of claim
 59. 85. A memory devicecontaining software according to claim
 84. 86. A computer having loadedtherein, software according to claim 84.